Online Story Segmentation of Multilingual Streaming Broadcast News
نویسندگان
چکیده
We present an online story segmentation approach for Broadcast News (BN) that is built upon and integrated into BBN COTS multilingual Broadcast Monitoring System (BMS). We take a discriminative model-based approach, using Support Vector Machines to segment BN transcriptions into thematically coherent stories within the real-time constraints defined by BMS. We extract lexical, topical and story boundary cue features from source language transcriptions, machine translated (MT) English and metadata generated by BMS. We leverage BBN’s Topic Classification technique to extract topic persistence features, and incorporate topic supporting words and topic clusters to encode thematic transitions. Using the discriminative modelbased approach, we get a relative gain of 27.9% on English BN and 22.0% on Arabic BN over a rule-based system. We also demonstrate a relative improvement of 11.8% in segmentation performance using features extracted from MT English compared to Arabic source features. We highlight the impact of topic model training in our story segmentation approach by varying corpus size to achieve a 13.7% relative gain with increase in number of topics.
منابع مشابه
Feature Selection for Trainable Multilingual Broadcast News Segmentation
Indexing and retrieving broadcast news stories within a large collection requires automatic detection of story boundaries. This video news story segmentation can use a wide range of audio, language, video, and image features. In this paper, we investigate the correlation between automatically-derived multimodal features and story boundaries in seven different broadcast news sources in three lan...
متن کاملLarge, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts
This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segment...
متن کاملTracking topics in broadcast news data
This paper describes a topic tracking system and its ability to cope with sparse training data for broadcast news tracking. The baseline tracker which relies on a unigram topic model. In order to compensate for the very small amount of training data for each topic, document expansion is used in estimating the initial topic model, and unsupervised model adaptation is carried out after processing...
متن کاملPhoneme lattice based texttiling towards multilingual story segmentation
This paper proposes a phoneme lattice based TextTiling approach towards multilingual story segmentation. The phoneme is the smallest segmental unit in a language and the number of phonemes in a language is usually far smaller than the number of words. Furthermore, many phonemes are shared by different languages. These properties make phonemes particularly appropriate for representing multilingu...
متن کاملSpeaker role based structural classification of broadcast news stories
This paper is concerned with automatic classification of broadcast news stories based on speaker roles such as anchor, reporter and others. The story classification is the first step for many related tasks such as browsing, indexing, and summarising the news broadcast. We use broadcast news audio and its automatic speech recogniser transcripts to implement the classification system. It builds o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012